Welcome! The purpose of this project is to produce an algorithm (model) that can help the online real-estate website, Zillow.com, predict home sale prices with greater accuracy. Improving real-estate valuations are important for several reasons. Namely it:
To gather data, our team focused on Mecklenberg County, NC (Charlotte Metro Area) and sourced information from the county’s open data website, as well as the American Community Survey and U.S.Census.
To begin, we will import a home sales dataset that includes variables like location, housing characteristics, and home quality for the Charlotte Metro Area. After, we will ‘clean’ our data by creating useful columns such as, “building grade” as a numeric value (where higher values correspond to greater quality), “age of home (age)”, “price per square foot (pricesqft)” and calculating the # of “total baths (totbaths)” by joining full and half-bathroom information. Moving forward, we’ll refer to this home sales data as “internal variables.”
#Import sales data, remove columns we won't need, add useful columns, set CRS for North Carolina
library(dplyr)
library(sf)
CLT_internal <-
st_read("https://github.com/mafichman/MUSA_508_Lab/raw/main/Midterm/data/2022/studentData.geojson") %>%
st_transform('ESRI:103500')
CLT_internal <- CLT_internal[c(5,9,20,21,26,28,30:46,57:60,67,68,70,71,72)] %>%
mutate(
age = 2022 - yearbuilt,
sqft = (totalac * 43560),
pricesqft = ((totalac * 43560)/price),
totbaths = (halfbaths*0.5)+(fullbaths))
CLT_internal$quality <- recode(CLT_internal$bldggrade, MINIMUM = 1 , FAIR = 2, AVERAGE = 3, GOOD = 4, VERYGOOD = 5, EXCELLENT = 6, CUSTOM = 7)
To build a strong algorithm (model), it’s important to include variables that relate to the housing market such as local schools, grocery stores, and parks. We’ll refer to these variables as “amenities.”
# Adding school data
CLT_schools <-
st_read("https://github.com/mtdunst/Zestimate-Project/raw/main/Schools.geojson") %>%
st_transform(st_crs(CLT_internal))
# Adding grocery store data
CLT_grocery <-
st_read("Grocery_pts.geojson") %>%
st_transform(st_crs(CLT_internal))
# Adding parks data
CLT_parks <-
st_read("https://github.com/mtdunst/Zestimate-Project/raw/main/Parks.geojson") %>%
st_transform(st_crs(CLT_internal))
Finally, we will add variables that provide demographic and environmental data for the Charlotte Metro Area. Specifically, we will include educational attainment, household income, and neighborhoods data. We’ll refer to these variables as “spatial structure.”
# Adding demographic data
CLT_demo <-
get_acs(geography = "tract",
variables = c("B19013_001E", "B15003_022E","B15003_001E"),
year=2020, state=37, county=119,
geometry=TRUE, output="wide") %>%
st_transform(st_crs(CLT_internal)) %>%
dplyr::select( -NAME, -B19013_001M, -B15003_022M, -B15003_001M)
CLT_demo <-
CLT_demo %>%
rename(HH_inc = B19013_001E,
College = B15003_022E,
College_age_pop = B15003_001E) %>%
mutate(college_perc = College/College_age_pop) %>%
dplyr::select( -College, -College_age_pop)
# Adding neighborhood data
CLT_neighborhoods <-
st_read("https://github.com/mtdunst/Zestimate-Project/raw/main/School_districts.geojson") %>%
st_transform(st_crs(CLT_internal))
So far, we have added internal, amenities, and spatial structure variables. However, in order to build our model and analyze how these variables relate to home sales, we must modify them. We’ll achieve this using 2 techniques:
1. K-nearest neighbor (KNN): this will find the distance between a given home and the most near amenities (school, grocery store, park).
2. Spatial join (SJ): this will join our spatial structure data (educational attainment, neighborhoods) to our internal varies (Charlotte homes sales)
*# Most near school, grocery store, and park
CLT_internal <-
CLT_internal %>%
mutate(
school_nn1 = nn_function(st_coordinates(st_centroid(CLT_internal)), st_coordinates(st_centroid(CLT_schools)),k = 1),
grocery_nn1 = nn_function(st_coordinates(st_centroid(CLT_internal)), st_coordinates(st_centroid(CLT_grocery)), k = 1),
park_nn1 = nn_function(st_coordinates(st_centroid(CLT_internal)), st_coordinates(st_centroid(CLT_parks)), k = 1))
# Spatial join
CLT_internal <-
st_join(CLT_internal,CLT_demo)
CLT_internal <-
st_join(CLT_internal, CLT_neighborhoods)
Below are summary statistics tables for each variable category (internal, amenities,spatial structure).
#Internal variables
ss_internal <- CLT_internal
ss_internal <- st_drop_geometry(ss_internal)
ss_internal <- ss_internal %>%
dplyr::select("sqft","pricesqft", "totbaths", "yearbuilt")
stargazer(as.data.frame(ss_internal), type="text", digits=1, title = "Descriptive Statistics for Charlotte Metro Area Homes Internal Variables")
##
## Descriptive Statistics for Charlotte Metro Area Homes Internal Variables
## =======================================================
## Statistic N Mean St. Dev. Min Max
## -------------------------------------------------------
## sqft 46,183 77,176.9 4,648,606.0 0.0 425,034,086.0
## pricesqft 46,138 Inf.0 0.0 Inf.0
## totbaths 46,183 2.6 0.9 0.0 11.0
## yearbuilt 46,183 1,993.4 31.8 0 2,022
## -------------------------------------------------------
#Amenities
ss_amenities <- CLT_internal
ss_amenities <- st_drop_geometry(ss_amenities)
ss_amenities <- ss_amenities %>%
dplyr::select("school_nn1", "grocery_nn1", "park_nn1")
stargazer(as.data.frame(ss_amenities), type="text", digits=1, title = "Descriptive Statistics for Charlotte Metro Area Homes Amentity Variables")
##
## Descriptive Statistics for Charlotte Metro Area Homes Amentity Variables
## ================================================
## Statistic N Mean St. Dev. Min Max
## ------------------------------------------------
## school_nn1 46,183 3,266.8 1,390.4 8.2 8,369.4
## grocery_nn1 46,183 1,737.8 1,274.4 37.0 8,599.7
## park_nn1 46,183 1,239.8 767.7 21.0 5,026.3
## ------------------------------------------------
# Spatial Structure
ss_spatial <- CLT_internal
ss_spatial <- st_drop_geometry(ss_spatial)
ss_spatial <- ss_spatial %>%
dplyr::select("HH_inc", "college_perc", "FID")
stargazer(as.data.frame(ss_spatial), type="text", digits=1, title = "Descriptive Statistics for Charlotte Metro Area Homes Internal Variables")
##
## Descriptive Statistics for Charlotte Metro Area Homes Internal Variables
## ====================================================
## Statistic N Mean St. Dev. Min Max
## ----------------------------------------------------
## HH_inc 46,180 86,175.7 37,415.7 17,685 238,750
## college_perc 46,180 0.3 0.1 0.03 0.6
## FID 46,183 26.0 10.6 0 43
## ----------------------------------------------------
Below is a table visualizing correlation between our variables. We can see the home price maintains a postive correlation with the following variables (in order of strength):
We can use this matrix to inform our variable selection for our model.
numericVars <- select_if(st_drop_geometry(CLT_internal), is.numeric) %>% na.omit()
ggcorrplot(
round(cor(numericVars), 1),
p.mat = cor_pmat(numericVars),
colors = c("deepskyblue", "grey100", "firebrick1"),
type="lower",
insig = "blank") +
labs(title = "Correlation Matrix of Numeric Variables", tl.cex = 0.5, tl.col = "black") +
theme(axis.text.x = element_text(size = 10)) +
plotTheme()
Below are 4 home price correlation scatterplots based upon the results of our correlation matrix:
st_drop_geometry(CLT_internal) %>%
dplyr::select(price, quality, heatedarea, HH_inc, yearbuilt) %>%
filter(price < 10000000) %>%
gather(Variable, Value, -price) %>%
ggplot(aes(Value, price)) +
geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "hotpink") +
facet_wrap(~Variable, ncol = 3, scales = "free") +
labs(title = "Price as a function of Internal and Spatial Variables") +
plotTheme()
Below are 4 maps including:
1. A map of our dependent variable (price)
2. A map of park locations
3. A map of nearby grocery stores
4. A map of school quality
*ggplot() +
geom_sf(data = CLT_neighborhoods, fill = "grey40") +
geom_sf(data = CLT_internal, aes(colour = q5(price)),
show.legend = "point", size = .75) +
scale_colour_manual(values = palette5,
labels=qBr(CLT_internal,"price"),
name="Quintile\nBreaks") +
labs(title="Home Price", subtitle="Charlotte Metro Area") +
labs(color = "Observed Sales Price (quintiles)") +
mapTheme()
ggplot() +
geom_sf(data = CLT_neighborhoods, fill = "grey70") +
geom_sf(data = CLT_internal, aes(colour = q5(price)),
show.legend = "point", size = .75) +
scale_colour_manual(values = palette5,
labels=qBr(CLT_internal,"price"),
name="Quintile\nBreaks") +
geom_sf(data = CLT_parks, color="darkgreen") +
labs(title="Park Locations vs Home Price", subtitle="Charlotte Metro Area") +
mapTheme()
ggplot() +
geom_sf(data = CLT_neighborhoods, fill = "grey30") +
geom_sf(data = CLT_internal, aes(colour = q5(price)),
show.legend = "point", size = .75) +
scale_colour_manual(values = palette5,
labels=qBr(CLT_internal,"price"),
name="Quintile\nBreaks") +
geom_sf(data = CLT_grocery, color="deepskyblue") +
labs(title="Grocery Store Locations vs Home Price", subtitle="Charlotte Metro Area") +
mapTheme()
ggplot() +
geom_sf(data = CLT_schools, aes(fill=factor(Quality), color=factor(Quality))) +
scale_fill_brewer(palette="RdYlGn") +
scale_color_brewer(palette="RdYlGn") +
labs(title="School Quality", subtitle="Niche.com ratings; Charlotte Metro Area") +
mapTheme()
Now that we have identified important variables, we will build out model by dividing the data into training/test sets. Our data will be partitioned as a 60/40 train-test split. After, we’ll run a regression on our training set (60%) and use the results to determine the generalizability with our ‘test’ data.
However, before dividing the dataset we will create categorical variables for the number of floors, bathrooms, and bedrooms for a given home.
#Re-engineering data as categorical: number of floors
CLT_internal<-
CLT_internal%>%
mutate(NUM_FLOORS.cat = ifelse(storyheigh == "1 STORY" | storyheigh == "1.5 STORY" | storyheigh == "SPLIT LEVEL" | storyheigh == "2.0 STORY", "Up to 2 Floors",
ifelse(storyheigh == "2.5 STORY" | storyheigh == "3.0 STORY", "Up to 3 Floors", "4+ Floors")))
#Re-engineer bedroom as categorical
CLT_internal <-
CLT_internal %>%
mutate(NUM_BEDS.cat = ifelse(bedrooms <= 2, "Up to 2 Bedrooms",
ifelse(bedrooms == 3 | bedrooms == 4, "Up to 4 Bedrooms", "5+ Bedrooms")))
#Re-engineer bathroom data as categorical
CLT_internal <-
CLT_internal %>%
mutate(NUM_BATHS.cat = ifelse(totbaths <= 2.5, "Up to 2.5 Bathrooms",
ifelse(totbaths <= 3.5 | totbaths <= 4.5, "Up to 4 Bathrooms", "5+ Bathrooms")))
#Creating training data
inTrain <- createDataPartition(
y = paste(CLT_internal$NUM_FLOORS.cat, CLT_internal$NUM_BEDS.cat),
p = .60, list = FALSE)
charlotte.training <- CLT_internal[inTrain,]
charlotte.test <- CLT_internal[-inTrain,]
reg.training <- lm(price ~ ., data = st_drop_geometry(charlotte.training) %>%
dplyr::select(price, heatedarea,
quality, NUM_FLOORS.cat,
NUM_BEDS.cat, NUM_BATHS.cat,
park_nn1, grocery_nn1,
age, HH_inc, college_perc))
summary(reg.training)
##
## Call:
## lm(formula = price ~ ., data = st_drop_geometry(charlotte.training) %>%
## dplyr::select(price, heatedarea, quality, NUM_FLOORS.cat,
## NUM_BEDS.cat, NUM_BATHS.cat, park_nn1, grocery_nn1, age,
## HH_inc, college_perc))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1700590 -84337 -8355 62079 43777952
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -294136.1494 39669.6022 -7.415
## heatedarea 136.5569 4.4684 30.561
## quality 197602.6787 5045.2573 39.166
## NUM_FLOORS.catUp to 2 Floors 58969.3580 21485.5243 2.745
## NUM_FLOORS.catUp to 3 Floors 35852.5896 27889.6130 1.286
## NUM_BEDS.catUp to 2 Bedrooms 54004.0648 15516.9983 3.480
## NUM_BEDS.catUp to 4 Bedrooms 39899.3091 10175.7475 3.921
## NUM_BATHS.catUp to 2.5 Bathrooms -431122.6701 24530.2029 -17.575
## NUM_BATHS.catUp to 4 Bathrooms -439686.2605 22236.6845 -19.773
## park_nn1 -4.2150 3.3004 -1.277
## grocery_nn1 -21.9551 2.2168 -9.904
## age 2294.5786 121.2829 18.919
## HH_inc 0.7781 0.1164 6.687
## college_perc 24732.5021 32799.3916 0.754
## Pr(>|t|)
## (Intercept) 0.000000000000126 ***
## heatedarea < 0.0000000000000002 ***
## quality < 0.0000000000000002 ***
## NUM_FLOORS.catUp to 2 Floors 0.006063 **
## NUM_FLOORS.catUp to 3 Floors 0.198623
## NUM_BEDS.catUp to 2 Bedrooms 0.000502 ***
## NUM_BEDS.catUp to 4 Bedrooms 0.000088405034962 ***
## NUM_BATHS.catUp to 2.5 Bathrooms < 0.0000000000000002 ***
## NUM_BATHS.catUp to 4 Bathrooms < 0.0000000000000002 ***
## park_nn1 0.201564
## grocery_nn1 < 0.0000000000000002 ***
## age < 0.0000000000000002 ***
## HH_inc 0.000000000023232 ***
## college_perc 0.450824
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 395800 on 25554 degrees of freedom
## (2146 observations deleted due to missingness)
## Multiple R-squared: 0.3201, Adjusted R-squared: 0.3197
## F-statistic: 925.4 on 13 and 25554 DF, p-value: < 0.00000000000000022
To test the strength (accuracy) of our model in its ability to predict prices, we will:
#Creating predictions and calculating Mean Absolute Error (MAE) and Mean Absolute Percent Error (MAPE)
charlotte.test <-
charlotte.test %>%
mutate(price.Predict = predict(reg.training, charlotte.test),
price.Error = price.Predict - price,
price.AbsError = abs(price.Predict - price),
price.APE = (abs(price.Predict - price)) / price.Predict)%>%
filter(price < 5000000)
MAE <- mean(charlotte.test$price.AbsError, na.rm = T)
MAPE <- mean(charlotte.test$price.APE, na.rm = T)
reg.MAE.MAPE <-
cbind(MAE, MAPE) %>%
kable(caption = "Regression MAE & MAPE") %>%
kable_styling("hover",full_width = F)
reg.MAE.MAPE
| MAE | MAPE |
|---|---|
| 109529.1 | 0.2704347 |
#Cross-validation
fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)
reg.cv <-
train(price ~ ., data = st_drop_geometry(CLT_internal) %>%
dplyr::select(price, heatedarea,
quality, NUM_FLOORS.cat,
NUM_BEDS.cat, NUM_BATHS.cat, grocery_nn1,
age, HH_inc, college_perc,
park_nn1),
method = "lm", trControl = fitControl, na.action = na.pass)
# Table
stargazer(as.data.frame(reg.cv$resample), type="text", digits=1, title="Cross Validation Results")
##
## Cross Validation Results
## =======================================================
## Statistic N Mean St. Dev. Min Max
## -------------------------------------------------------
## RMSE 100 268,328.6 296,854.0 133,413.4 2,137,604.0
## Rsquared 100 0.6 0.2 0.01 0.8
## MAE 100 114,163.2 17,868.1 93,420.4 207,393.9
## -------------------------------------------------------
# Plot
ggplot(reg.cv$resample, aes(x=MAE)) +
geom_histogram(fill = "darkgreen") +
labs(title = "Count of Mean Average Error During Cross-Validation") +
xlab("MAE")+
ylab("Count")+
plotTheme()
#Visualizing prediction errors
charlotte.APE <- charlotte.test[c(6,36,43:46,51,54)]
charlotte_APE.sf <-
charlotte.APE %>%
filter(price.APE > 0) %>%
st_as_sf(sf_column_name=geometry) %>%
st_transform('ESRI:103500')
grid.arrange(ncol=2,
ggplot() +
geom_sf(data = CLT_neighborhoods, fill = "grey40") +
geom_sf(data = charlotte_APE.sf, aes(color = q5(price.Predict)), size = .25) +
scale_colour_manual(values = palette5,
labels=qBr(charlotte.APE,"price"),
name="Quintile\nBreaks") +
labs(title="Predicted Sales Price") +
mapTheme(),
ggplot() +
geom_sf(data = CLT_neighborhoods, fill = "grey40") +
geom_sf(data = charlotte_APE.sf, aes(color = price.APE), size = .25) +
labs(title="Predicted Sales Price\nAbsolute Percent Error") +
binned_scale(aesthetics = "color",
scale_name = "stepsn",
palette = function(x) c("#1a9641", "#a6d96a", "#ffffbf", "#fdae61", "#d7191c"),
breaks = c(0.10, 0.20, 0.5, 0.75),
limits = c(0, 50),
show.limits = TRUE,
guide = "colorsteps"
) +
mapTheme())
#Predicted vs observed sales price
ggplot(
charlotte_APE.sf, aes(price, price.Predict, col = price.APE)) +
binned_scale(aesthetics = "color",
scale_name = "stepsn",
palette = function(x) c("#1a9641", "#a6d96a", "#ffffbf", "#fdae61", "#d7191c"),
breaks = c(0.10, 0.20, 0.5, 0.75),
limits = c(0, 50),
show.limits = TRUE,
guide = "colorsteps") +
geom_point(size=1) +
scale_y_continuous(limits = c(0, 4000000)) +
scale_x_continuous(limits = c(0, 4000000)) +
labs(title="Sales Price vs. Predicted", subtitle="Charlotte Metro Area") +
ylab("Predicted Sales Price (in dollars)") +
xlab("Observed Sales Price (in dollars)") +
geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "black", size = 0.5) +
labs(color = "Absolute % Error") +
geom_label(
label="0% error line",
x=3500000,
y=3000000,
label.padding = unit(0.25, "lines"), # Rectangle size around label
label.size = 0.15,
color = "black",
fill="grey80")
By calculating Moran’s I for this dataset, we are determining if there is a spatial autocorrelation. relationship among home sales in Charlotte. As seen in the outcome, Moran’s I was calculated to be high, meaning there is a spatial relationship in the predicted errors that must be accounted for.
#Calculating Moran's I
coords.test <- st_coordinates(charlotte.test)
neighborList.test <- knn2nb(knearneigh(coords.test, 5))
spatialWeights.test <- nb2listw(neighborList.test, style="W")
charlotte.test %>%
mutate(lagPriceError = lag.listw(spatialWeights.test, price.Error, NAOK = TRUE))
## Simple feature collection with 18455 features and 54 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 422146 ymin: 141808.1 xmax: 466272.4 ymax: 196797.3
## Projected CRS: NAD_1983_CORS96_StatePlane_North_Carolina_FIPS_3200
## First 10 features:
## pid totalac codemunici municipali zipcode price yearbuilt
## 1 00101333 0.5000 <NA> HUNTERSVILLE 28078 1315000 2005
## 2 00101334 0.5000 <NA> HUNTERSVILLE 28078 925000 1997
## 3 00101337 0.4155 <NA> HUNTERSVILLE 45601 2523000 2018
## 4 00101341 0.0000 6 HUNTERSVILLE 28078 250000 1980
## 5 00101433 0.0000 6 HUNTERSVILLE 28078 675000 2003
## 6 00101461 0.0000 6 HUNTERSVILLE 28078 422000 1992
## 7 00102218 2.4400 6 HUNTERSVILLE 28078 272000 1910
## 8 00102221 0.0000 6 HUNTERSVILLE 28078 587500 1988
## 9 00102318 1.2820 6 HUNTERSVILLE 28081 819500 1976
## 10 00102320 0.5950 6 HUNTERSVILLE 28202 738000 1965
## heatedarea cdebuildin descbuildi storyheigh aheatingty heatedfuel actype
## 1 6187 01 RES 2.0 STORY AIR-DUCTED GAS AC-CENTRAL
## 2 2736 01 RES 1.5 STORY AIR-DUCTED GAS AC-CENTRAL
## 3 5180 01 RES 1.5 STORY AIR-DUCTED GAS AC-CENTRAL
## 4 1242 01 RES 1 STORY AIR-DUCTED ELECTRIC AC-CENTRAL
## 5 2716 01 RES 1 STORY AIR-DUCTED GAS AC-CENTRAL
## 6 2522 01 RES 1.5 STORY AIR-DUCTED GAS AC-CENTRAL
## 7 1312 01 RES 1 STORY AIR-DUCTED GAS AC-NONE
## 8 2870 01 RES 2.0 STORY HEAT PUMP ELECTRIC AC-CENTRAL
## 9 3871 01 RES 1 STORY AIR-DUCTED ELECTRIC AC-CENTRAL
## 10 3400 01 RES 2.0 STORY AIR-DUCTED ELECTRIC AC-CENTRAL
## extwall foundation numfirepla fireplaces bldggrade fullbaths
## 1 STUCCO HRDCT CRAWL SPACE 1 FP4 EXCELLENT 6
## 2 FACE BRICK CRAWL SPACE 1 FP2 GOOD 2
## 3 HARDIPLK/DSGN VINYL CRAWL SPACE 1 FP2 EXCELLENT 4
## 4 FACE BRICK CRAWL SPACE 0 <NA> AVERAGE 1
## 5 ALUM,VINYL CRAWL SPACE 1 FP2 GOOD 2
## 6 FACE BRICK CRAWL SPACE 1 FP4 GOOD 2
## 7 WOOD ON SHTG CRAWL SPACE 1 FP3 AVERAGE 1
## 8 WOOD ON SHTG CRAWL SPACE 1 FP3 GOOD 2
## 9 FACE BRICK CRAWL SPACE 1 FP3 AVERAGE 3
## 10 WOOD SHINGLE CRAWL SPACE 1 FP3 GOOD 3
## halfbaths bedrooms units landusecod physicalde propertyus descproper
## 1 1 7 1 R122 AV 01 Single-Family
## 2 1 3 1 R122 AV 01 Single-Family
## 3 1 5 1 R122 AV 01 Single-Family
## 4 1 3 1 R100 AV 01 Single-Family
## 5 1 3 1 R100 AV 01 Single-Family
## 6 1 4 1 R100 AV 01 Single-Family
## 7 0 2 1 R120 AV 01 Single-Family
## 8 1 3 1 R100 AV 01 Single-Family
## 9 0 3 1 R122 AV 01 Single-Family
## 10 0 4 1 R122 AV 01 Single-Family
## shape_Leng shape_Area musaID toPredict age sqft pricesqft totbaths
## 1 630.5433 19277.33 7 MODELLING 17 21780.00 0.016562738 6.5
## 2 633.6883 23621.33 8 MODELLING 25 21780.00 0.023545946 2.5
## 3 618.1447 18101.26 9 MODELLING 4 18099.18 0.007173674 4.5
## 4 565.5071 18096.04 10 MODELLING 42 0.00 0.000000000 1.5
## 5 686.9222 19762.49 14 MODELLING 19 0.00 0.000000000 2.5
## 6 575.6071 20075.25 16 MODELLING 30 0.00 0.000000000 2.5
## 7 1475.0216 102353.15 21 MODELLING 112 106286.40 0.390758824 1.0
## 8 650.2860 25235.96 22 MODELLING 34 0.00 0.000000000 2.5
## 9 942.4561 55826.73 26 MODELLING 46 55843.92 0.068143893 3.0
## 10 702.0724 25346.26 27 MODELLING 57 25918.20 0.035119512 3.0
## quality school_nn1 grocery_nn1 park_nn1 GEOID HH_inc college_perc FID
## 1 6 6976.639 3974.910 2579.115 37119006220 92038 0.293934 24
## 2 4 7157.136 3970.221 2737.943 37119006220 92038 0.293934 24
## 3 6 7101.836 3929.905 2715.030 37119006220 92038 0.293934 24
## 4 3 6951.205 4149.446 2444.677 37119006220 92038 0.293934 24
## 5 4 6326.410 3685.299 2338.371 37119006220 92038 0.293934 24
## 6 4 6150.081 3232.795 2156.501 37119006220 92038 0.293934 24
## 7 3 5882.428 2654.306 1504.839 37119006220 92038 0.293934 24
## 8 4 6186.072 2484.837 1378.492 37119006220 92038 0.293934 24
## 9 3 5861.251 2384.744 1175.440 37119006220 92038 0.293934 24
## 10 4 5941.004 2294.824 1129.048 37119006220 92038 0.293934 24
## MIDD_NAME Shape_Leng Shape_Area geometry NUM_FLOORS.cat
## 1 Bradley 2486536 1051124394 POINT (433648.4 188499) Up to 2 Floors
## 2 Bradley 2486536 1051124394 POINT (433609.9 188678) Up to 2 Floors
## 3 Bradley 2486536 1051124394 POINT (433660.1 188638.3) Up to 2 Floors
## 4 Bradley 2486536 1051124394 POINT (433492.7 188406.7) Up to 2 Floors
## 5 Bradley 2486536 1051124394 POINT (434138.4 187989) Up to 2 Floors
## 6 Bradley 2486536 1051124394 POINT (434656.6 187965) Up to 2 Floors
## 7 Bradley 2486536 1051124394 POINT (435429.3 187849.1) Up to 2 Floors
## 8 Bradley 2486536 1051124394 POINT (435415.7 188153.8) Up to 2 Floors
## 9 Bradley 2486536 1051124394 POINT (435774.5 187865.8) Up to 2 Floors
## 10 Bradley 2486536 1051124394 POINT (435819.2 187949.2) Up to 2 Floors
## NUM_BEDS.cat NUM_BATHS.cat price.Predict price.Error
## 1 5+ Bedrooms 5+ Bathrooms 1815081.0 500081.00
## 2 Up to 4 Bedrooms Up to 2.5 Bathrooms 575184.5 -349815.51
## 3 5+ Bedrooms Up to 4 Bathrooms 1208467.6 -1314532.39
## 4 Up to 4 Bedrooms Up to 2.5 Bathrooms 209874.8 -40125.17
## 5 Up to 4 Bedrooms Up to 2.5 Bathrooms 566625.6 -108374.42
## 6 Up to 4 Bedrooms Up to 2.5 Bathrooms 576075.2 154075.23
## 7 Up to 2 Bedrooms Up to 2.5 Bathrooms 430946.5 158946.46
## 8 Up to 4 Bedrooms Up to 2.5 Bathrooms 652476.2 64976.16
## 9 Up to 4 Bedrooms Up to 4 Bathrooms 613591.7 -205908.28
## 10 Up to 4 Bedrooms Up to 4 Bathrooms 774286.2 36286.22
## price.AbsError price.APE lagPriceError
## 1 500081.00 0.27551443 -384842.62
## 2 349815.51 0.60817967 -214863.32
## 3 1314532.39 1.08776800 -21919.94
## 4 40125.17 0.19118617 -270213.69
## 5 108374.42 0.19126285 NA
## 6 154075.23 0.26745679 NA
## 7 158946.46 0.36883111 -127849.90
## 8 64976.16 0.09958396 -201622.99
## 9 205908.28 0.33557864 -54878.95
## 10 36286.22 0.04686410 NA
moranTest <- moran.mc(charlotte.test$price.Error,
spatialWeights.test, nsim = 999, na.action=na.exclude, , zero.policy = TRUE)
# Observed and permuted Moran's I
ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
geom_histogram(binwidth = 0.01) +
geom_vline(aes(xintercept = moranTest$statistic), colour = "#FA7800",size=1) +
scale_x_continuous(limits = c(-1, 1)) +
labs(title="Observed and permuted Moran's I",
subtitle= "Observed Moran's I in orange",
x="Moran's I",
y="Count") +
plotTheme()
We introduce neighborhoods into the model in an attempt to account for spatial bias in our predictions. Specifically, we have included “Census Block Groups” as it was readily available information. The addition of this variable increases the R-squared of our model, meaning it is able to explain more of the observed variance with neighborhoods included than without it.
#Adjusting for neighborhod
reg.nhood <- lm(price ~ ., data = as.data.frame(charlotte.training) %>%
dplyr::select(price, heatedarea,
quality, NUM_FLOORS.cat,
NUM_BEDS.cat, NUM_BATHS.cat,
park_nn1, grocery_nn1,
age, HH_inc, college_perc))
summary(reg.nhood)
##
## Call:
## lm(formula = price ~ ., data = as.data.frame(charlotte.training) %>%
## dplyr::select(price, heatedarea, quality, NUM_FLOORS.cat,
## NUM_BEDS.cat, NUM_BATHS.cat, park_nn1, grocery_nn1, age,
## HH_inc, college_perc))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1700590 -84337 -8355 62079 43777952
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -294136.1494 39669.6022 -7.415
## heatedarea 136.5569 4.4684 30.561
## quality 197602.6787 5045.2573 39.166
## NUM_FLOORS.catUp to 2 Floors 58969.3580 21485.5243 2.745
## NUM_FLOORS.catUp to 3 Floors 35852.5896 27889.6130 1.286
## NUM_BEDS.catUp to 2 Bedrooms 54004.0648 15516.9983 3.480
## NUM_BEDS.catUp to 4 Bedrooms 39899.3091 10175.7475 3.921
## NUM_BATHS.catUp to 2.5 Bathrooms -431122.6701 24530.2029 -17.575
## NUM_BATHS.catUp to 4 Bathrooms -439686.2605 22236.6845 -19.773
## park_nn1 -4.2150 3.3004 -1.277
## grocery_nn1 -21.9551 2.2168 -9.904
## age 2294.5786 121.2829 18.919
## HH_inc 0.7781 0.1164 6.687
## college_perc 24732.5021 32799.3916 0.754
## Pr(>|t|)
## (Intercept) 0.000000000000126 ***
## heatedarea < 0.0000000000000002 ***
## quality < 0.0000000000000002 ***
## NUM_FLOORS.catUp to 2 Floors 0.006063 **
## NUM_FLOORS.catUp to 3 Floors 0.198623
## NUM_BEDS.catUp to 2 Bedrooms 0.000502 ***
## NUM_BEDS.catUp to 4 Bedrooms 0.000088405034962 ***
## NUM_BATHS.catUp to 2.5 Bathrooms < 0.0000000000000002 ***
## NUM_BATHS.catUp to 4 Bathrooms < 0.0000000000000002 ***
## park_nn1 0.201564
## grocery_nn1 < 0.0000000000000002 ***
## age < 0.0000000000000002 ***
## HH_inc 0.000000000023232 ***
## college_perc 0.450824
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 395800 on 25554 degrees of freedom
## (2146 observations deleted due to missingness)
## Multiple R-squared: 0.3201, Adjusted R-squared: 0.3197
## F-statistic: 925.4 on 13 and 25554 DF, p-value: < 0.00000000000000022
charlotte.test.nhood <-
charlotte.test %>%
mutate(Regression = "Neighborhood Effects",
price.Predict = predict(reg.nhood, charlotte.test),
price.Error = price.Predict- price,
price.AbsError = abs(price.Predict- price),
price.APE = (abs(price.Predict- price)) / price)%>%
filter(price < 5000000)
charlotte.test <-charlotte.test %>%
mutate(Regression = "Baseline")
sales_predictions.sf <- CLT_internal %>%
mutate(price.Predict = predict(reg.nhood, CLT_internal)) %>%
filter(toPredict == "CHALLENGE")
sales_predictions.df <- as.data.frame(st_drop_geometry(sales_predictions.sf))
sales_predictions.df <- sales_predictions.df[c(30,43)]
write.csv(sales_predictions.df,"Quisqueyanes.csv", row.names = FALSE)
bothRegressions <-
rbind(
dplyr::select(charlotte.test, starts_with("price"), Regression, MIDD_NAME) %>%
mutate(lagPriceError = lag.listw(spatialWeights.test, price.Error, NAOK=TRUE)),
dplyr::select(charlotte.test.nhood, starts_with("price"), Regression, MIDD_NAME) %>%
mutate(lagPriceError = lag.listw(spatialWeights.test, price.Error, NAOK=TRUE)))
While controlling for neighborhood improved our model, the difference is small and hard to notice visually like this.
#Neighborhood effect results
bothRegressions %>%
dplyr::select(price.Predict, price, Regression) %>%
ggplot(aes(price, price.Predict)) +
geom_point() +
stat_smooth(aes(price, price),
method = "lm", se = FALSE, size = 1, colour="#FA7800") +
stat_smooth(aes(price.Predict, price),
method = "lm", se = FALSE, size = 1, colour="#25CB10") +
facet_wrap(~Regression) +
labs(title="Predicted sale price as a function of observed price",
subtitle="Orange line represents a perfect prediction; Green line represents prediction") +
plotTheme()
#Scatter plot of MAPE by neighborhood mean price
npa.mean.sf <- charlotte.test %>%
drop_na(price.APE) %>%
group_by(MIDD_NAME) %>%
summarise(mean_APE = mean(price.APE))
npa.price.sf <- charlotte.test %>%
drop_na(price.APE) %>%
group_by(MIDD_NAME) %>%
summarise(mean_Price = mean(price.Predict))
MAPE_by_NPA <- merge(st_drop_geometry(npa.mean.sf), npa.price.sf, by="MIDD_NAME")
ggplot(MAPE_by_NPA, aes(mean_Price, mean_APE))+
geom_jitter(height=2, width=2)+
ylim(-5,5)+
geom_smooth(method = "lm", aes(mean_Price, mean_APE), se = FALSE, colour = "red") +
labs(title = "MAPE by Neighborhood Mean Sales Price",
x = "Mean Home Price", y = "MAPE") +
plotTheme()
# ggplot(npa.mean.sf, aes(x=MIDD_NAME, y=mean_APE)) +
# geom_point(alpha=0.5) +
# labs(title="Sales Price vs. Prediction Error", subtitle="Charlotte Area Home Sales") +
# ylab("Mean Absolute % Error") +
# xlab("Observed Sales Price") +
# theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
st_drop_geometry(charlotte.test) %>%
group_by(MIDD_NAME) %>%
summarize(MAPE = mean(price.APE, na.rm = T)) %>%
ungroup() %>%
left_join(CLT_neighborhoods) %>%
st_sf() %>%
ggplot() +
geom_sf(aes(fill = MAPE), color=NA) +
scale_fill_gradient(low = palette5[5], high = palette5[1],
name = "MAPE") +
geom_sf(data = charlotte.test, colour = "black", show.legend = "point", size = .05) +
labs(title = "Mean test set MAPE by Block Groups", subtitle="Charlotte Metro Area") +
mapTheme()
Our variables are able to explain ~50% of the variation observed in Charlotte home prices (R squared). Our mean absolute error (MAE) is 106593.9, indicating that on average, our model’s price predictions differ from actual home prices by ~$106,593. This is fair and may be due to outlier (costly homes) included in our dataset. Ultimately, the model is generalizeable, but requires some tweaks before our Zillow pitch meeting :)
Overall, our model is sufficient and can be strengthened with a few modification. For example, for this project we were not allowed to use sales data towards our algorithm. This rule added a level of difficulty because sales data is a strong predictor towards home valuation (e.g. the sales’ price of your neighbor’s home is likely very similar to the sales price of your home). We were able to make up for this by emphasizing other variables such as # of floors, bedrooms, bathrooms,housing quality, the quality of nearby schools, as well as a home’s proximity to nearby grocery stores.